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Abstract. Estimating entropies from limited data series is known to be a non-trivial 
task. Naive estimations are plagued with both systematic (bias) and statistical errors. 
Here, we present a new "balanced estimator" for entropy functionals (Shannon, Renyi 
and Tsallis) specially devised to provide a compromise between low bias and small 
statistical errors, for short data series. This new estimator out-performs other currently 
available ones when the data sets are small and the probabilities of the possible outputs 
of the random variable are not close to zero. Otherwise, other well-known estimators 
remain a better choice. The potential range of applicability of this estimator is quite 
broad specially for biological and digital data series. 
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1. Introduction 



In statistical mechanics and information theory, entropy is a functional that measures 
the information content of a statistical ensemble or equivalently the uncertainty of a 
random variable. Its applications in physics, biology, computer science, linguistic, etc 
are countless. For example, it has become a key tool in data mining tasks arising from 
high-throughput biological analyses. 

Historically, the most important example of such a functional is the Shannon 
(or information) entropy [lj [2]. For a discrete random variable x, which can take a 
finite number, M, of possible values X{ 6 {x±, . . . , xm} with corresponding probabilities 
Pi ^ {Pij • • • ,Pm}i this entropy is defined by: 



M 

H s = -^2pM(pi). 

8=1 



(1) 
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Recently, various generalizations, inspired by the study of g-deformed algebras and 
special functions, have been investigated, most notably the Renyi entropy [3]: 



H R {q)= Y^ln(X>?Y (2) 

with p > 0, which, in particular, reduces to the Shannon entropy in the limit q — > 1. 
Also, the Tsallis entropy [4]: 

HT{q)= ^(l-f>*)> ( 3 ) 

although controversial, has generated a large burst of research activity. 

In general, the full probability distribution for a given stochastic problem is not 
known and, in particular, in many situations only small data sets from which to infer 
entropies are available. For example, it could be of interest to determine the Shannon 
entropy of a given DNA sequence. In such a case, one could estimate the probability of 
each element i to occur, p i; by making some assumption on the probability distribution, 
as for example (i) parametrizing it [5], (ii) dropping the most unlikely values [6] or 
(iii) assuming some a priori shape for the probability distribution [3 [8]. However, the 
easiest and most objective way to estimate them is just by counting how often the value 
Xi appears in the data set (9J [101, [HJ [12], [HI IS, HE] . Denoting this number by and 
dividing by the total size of the data set one obtains the relative frequency: 

*= N (4) 
which naively approximates the probability p^. Obviously, the entropy of the data set can 
be approximated by simply replacing the probabilities pi by pi in the entropy functional. 
For example, the Shannon entropy can be estimated by: 

M M 



Ew»tfo = -E£ h (£)- (5) 

i=l i=l 



The quantity H™ aive is an example of an estimator of the entropy, in a very similar 
sense as pi is an estimator of pi. However, there is an important difference stemming 
from the non-linear nature of the entropy functional. The frequencies pi are unbiased 
estimators of the probabilities, i.e., their expectation value (pi) (where (•) stands for 
ensemble averages) coincides with the true value of the estimated quantity: 

(ft>= ^=ft. (6) 

In other words, the frequencies pi approximate the probabilities Pi with certain statistical 
error (variance) but without any systematic error (bias). Contrarily, naive entropy 
estimators, such as Hg aive , in which the pi are simply replaced by rii/N are always 
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biased, i.e. they deviate from the true value of the entropy not only statistically but 
also systematically. Actually, defining an error variable q = (pi —Pi)/Pi, and replacing 
Pi in Eq.(l) by its value in terms of and pi, it is straightforward to verify that the 
bias, up to leading order, is — ^^r, which is a significant error for small N and vanishes 
only as N — > oo [12] . A similar bias, owing in general to the nonlinearity of the entropy 
functional, appears also for the Renyi and Tsallis entropies. 

Therefore, the question arises whether it is possible to find improved estimators 
which reduce either the bias or the variance of the estimate. More generally, the problem 
can be formulated as follows. Given an arbitrary entropy functional of the form: 



H= F 



M 



i=l 



(7) 



(where F is a generic function) we want to find an estimator 

M 



H= F 



i=l 



such that the bias 

A = (H)-H 
or the mean squared deviation (the statistical error) 



a 



H-(H) 



(8) 



(9) 



(10) 



or a combination of both are as small as possible. At the very end of such a calculation 
the estimator is defined by iV + 1 real numbers x ni [13j, which depend on the sample 
size N. For example, the naive estimator for the Shannon entropy would be given in 
terms of 

The search for improved estimators has a long history. To the best of our knowledge, 
the first to address this question was Miller in 1955 [H], who suggested a correction to 
reduce the bias of the estimate of Shannon entropy, given by: 



X 



Miller 



N \NJ 2N 



(12) 



The correction exactly compensates the leading order of the bias, as reported above. In 
this case the remaining bias vanishes as 1/N 2 as iV — > oo. This result was improved 
by Harris in 1975 [15], who calculated the next-leading order correction. However, his 
estimator depends explicitly on the (unknown) probabilities p i7 so that its practical 
importance is limited. 

In another pioneering paper, Grassberger, elaborating upon previous work by 
Herzel [16], proposed an estimator which provides further improvement and gives a 
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very good compromise between bias and statistical error [SJ. For the Shannon entropy 
his estimator is given by 

Grassier = ^ L N - ^) - ^ ) , (13) 

N V ^(rii + 1)/ 

where ip(x) is the derivative of the logarithm of the Gamma function, valid for all i with 
rii > 0. According to [9], the function ip(x) can be approximated by 

ip(m) &\nx — — (14) 

for large x, giving 

Grassberger _IS n (Mi J__ (1 

Xn ' AT vjW + 2 iV N(m + 1) 1 J 

This method can be generalized to g-deformed entropies. 

More recently, a further improvement for the Shannon case has been suggested by 
Grassberger [10] 



GS _ J]± 

Am jy 



i 4-m—i 

4>(N) - Mm) - (-i) ni / —-dt 



l + t 

This estimator can be recast (see Eqs. (28), (29), (35) of Ref. [ID] ) as 



(16) 



x ^ = J t(lnN-G ni ), (17) 

where the G n satisfy the recurrence relation 

G x = - 7 -In 2 (18) 
G 2 =2- 7 -ln2 (19) 

G2n+1 — G2n (20) 

G 2n+2 = G 2n + 2/(2n + l) (21) 

with 7 = This estimator constitutes the state of the art for Shannon entropies, 

but unfortunately, it cannot be straightforwardly extended to more general q-deformed 
entropy functionals, for which [9J remains the best available option. These results were 
further generalized by Schiirmann [11] with different balances between statistical and 
systematic errors. 

It should be emphasized that an ideal estimator does not exist, instead the choice 
of the estimator depends on the structure of data to be analyzed p2]. For example, the 
above discussed estimators 0, [10] work satisfactorily if the probabilities Pi are sufficiently 
small. This is the case in many applications of statistical physics, where the number of 
possible states, M, in an ensemble is usually extremely large so that the probability pi 
for an individual state i is very small. On the other hand, this assumption does not 
always hold for empirical data sets such as digital data streams and DNA sequences. 
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The performance of the estimators worsens as the values of Pi get larger. This is 
due to the following reason: the numbers rij, which count how often the value Xi appears 
in the data set, are generically distributed as binomials, i.e. the probability P n% to find 
the value is given by: 

Pfn(Pi)= (TjpTil-Pi)"-"' (22) 

where = — — — are binomial coefficients. For pi <C 1 this can be 

\mj ni\(N-ni)\ 

approximated by a Poisson distribution, which is the basis for the derivation of Eq. (fT3|) . 
For large values Pi, however, this assumption is no longer justified and this results in 
large fluctuations (even if the bias remains small). 

It is important to note that it is not possible to design an estimator that minimizes 
both the bias and the variance to arbitrarily small values. The existing studies have 
shown that there is always a delicate tradeoff between the two types of errors. For 
example, minimizing the bias usually comes at the expense of the variance, which 
increases significantly. Moreover, it can be proved that neither the variance nor the bias 
can be reduced to zero for finite N [18] . Therefore, it is necessary to study estimators 
with different balances between systematic and statistical errors, as it was done e.g. in 
the work by Schurmann [TT] . 

In the present work we introduce two estimators, which can be used to measure any 
of the entropy functionals discussed above. Both of them are specifically designed for 
short data series where the probabilities pi take (in general) non-small values. The first 
one reduces the bias as much as possible at the expense of the variance, and is mostly 
of academic interest and discussed only for illustration purposes. The second one seeks 
for a robust compromise between minimizing bias and variance together, is very easy to 
implement numerically, and has a broad potential range of applicability. The estimator 
itself can be improved by adapting various of its elements to each specific problem. 



2. Low-bias estimator 

The starting point is the observation that the entropy H and its estimators H in Eq. (J7J) 
and Eq. (jSJ) involve sums over all possible values of the data set. Therefore, as the bias 
can be minimized by minimizing the errors of each summand, the problem can be 
reduced to minimize 

S(pi) = (Xm) - h (Pi) = [J2 P nMXnA ~ h(pi) (23) 

over a broad range of Pi as much as possible. 

A theorem by Paninski [18] states that it is impossible to reduce the bias to zero 
for all Pi G [0, 1] since an estimator is always a finite polynomial in pi while the true 
entropy is usually not a polynomial. However, it is possible to let the bias vanish at 




0,2 0,4 0,6 0,8 p 1 

Figure 1. Fluctuations, a 2 , as defined by Eq. (|10[) . for three different Shannon entropy 
estimates (the naive one, the improved estimator introduced in [10j . and the low-bias 
estimator defined in this paper) for a binary sequence (M = 2) of length N = 20. 
Inset: Bias, A, as defined by Eq. Q, for the low-bias estimator showing the N+l 
vanishing points with amplitude oscillations. 



N + l points pi in the interval [0, 1] because the determination of the different Xnt 
requires N + l independent equations. 

For the sake of illustration, let us choose here equidistant points pj = j/N, with 
j — 0, 1, . . . , N. In general, other choices, more appropriate to each specific case, should 
be employed. The resulting set of linear equations reads: 

N 

6(j/N) = 0=> Y,P 1H (j/N)xr H = h(j/N), j = 0,l,...,N (24) 

rii=0 

Introducing the notation hj = h(j/N) and Pj m = P ni (j/N) this last expression takes 
the form: 

JV 

Y. P i^= h i> j = 0,l,...,iV (25) 

ni=Q 

or, in short, = h , where P is the so-called multinomial matrix |19j . To find the 
solution ~x = P 1 h , the matrix: 

whose elements are binomial distributions, has to be inverted. For small N this inversion 
is most easily done numerically. However, we were also able to invert the matrix 
analytically, leading us to the closed form 



k=0 1=0 
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where s(l,k) denotes the Stirling numbers of the first kind [213 • Having inverted the 
matrix, the numbers \ ni determining the estimators can be computed for any given 
entropy functional by a simple matrix multiplication. 

Figure 1 illustrates a comparison for the Shannon case between the low-bias 
estimator and other well-known ones for the simple example of a binary sequence of 
N = 20 bits x = 0, 1 (i.e. M — 2), where the value 1 appears with probability p and 
with probability 1 — p. The bias of the low-bias estimator vanishes exactly only at 
values of p multiples of 1/20, and takes small values in between (see inset of Fig.l). On 
the other hand, the fluctuations for both the naive estimator and the one in [10] remain 
bounded, while they diverge for the low-bias case (Fig.l ). This unbounded growing of 
statistical fluctuations makes the low-bias estimator useless for practical purposes. 

3. Balanced Estimator 

Aiming at solving the previously illustrated problem with uncontrolled statistical 
fluctuations, in this section we introduce a new balanced estimator designed to minimize 
simultaneously both the bias and the variance over a wide range of probabilities. This 
is of relevance for analyzing small data sets where statistical fluctuations are typically 
large and a compromise with minimizing the bias is required. 

As before, ignoring correlations between the both bias and statistical errors 
can be optimized by minimizing the errors of the summands in their corresponding 
expressions. Therefore, the problem can be reduced to minimize the bias for each state 

*(p0= (xO-MpO (28) 

and the variance within such a state 

°\Pi) = ((Xn, ~ (Xn,)) 2 ) (29) 

over a broad range of Pi G [0, 1], where rij E 0, 1, . . . , N is binomially distributed. Since 
we are interested in a balanced compromise error, it is natural to minimize the squared 
sum: 

$ 2 ( Pl )=5 2 (p t ) + o- 2 (p t ). (30) 

This quantity measures the total error for a particular value of p,. Therefore, the average 
error over the whole range of Pi G [0, 1] is given by: 

$f= f dp lW {Pr)$ 2 {Pi) (31) 

Jo 

where w(pi) is a suitable weight function that should be determined for each specific 
problem. 

We discuss explicitly here the simplest case w(pi) = 1 (obviously, any extra 
knowledge of the probability values should lead to a non-trivial distribution of weights, 
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resulting in improved results). Inserting Eq. 
average error is given by: 



and Eq. into Eq. ([SI]), the 



$2 



dpi 



N 



N 



E P nMxl t ) + h 2 ( Pl ) ~ 2h(pi) [ PnMX, 
_ \ni=0 



\rii=0 



(32) 



Now, we want to determine the numbers \ ni in such a way that the error given by 
Eqj32l is minimized. Before proceeding, let us make it clear that instead of minimizing 
the mean-square error for each of the possible states {% = 1,...,M) one could also 
minimize the total mean-square error defined using Eq.Q and (flOl) rather than Eq. 
( 128]) and (|29l) to take into account correlations between boxes which, in general, will 
improve the final result. For example, for binary sequences this can be easily done, and 
leads to the same result as reported on what follows 

As a necessary condition, the partial derivatives: 







have to vanish, i.e.: 



2 / dp i P ni (p i )\xn i -h(p i )]= 
Jo 



(33) 



(34) 



for all rii = 0, 1, . . . , N. Therefore, the balanced estimator is defined by the numbers: 



Xm 



dpiP n .{Pi)h(pi 



dpiP ni {Pi 



{N + l) [ dpiP ni (pi)h(pi). 
Jo 



(35) 



where we have explicitly integrated over p f the binomial distribution. 

In the Shannon case, where h(pi) = — pdn(pj), the integration can be explicitely 
carried out, leading to [22J: 



n; 



Xrii 



1 N+2 1 

E 

j=rii+2 



N + 2 . — n3 



(36) 



so that the final result for the balanced estimator of Shannon entropy is given by: 



jjbal 



1 M I" N+2 

^E K+i) E j 

i=l _ i=ni+2 J 



j=ni+2 



(37) 



Similarly, it is possible to compute Xni for a power h(pi) = p^, which is the basis for all 
g-deformed entropies: 



XnM 



T(N + 2)T(n l + 1 + q) 
T{N + 2 + q)T{n i + l) 



(3? 



Entropy estimates of small data sets 



9 



- - Estimator reference [ 1 OJ 
- Naive estimator 

— Balanced estimator 



0,01 -i 




Estimator reference [10J 
Naive estimator 
Balanced estimator 




Estimator reference [10] 
Naive estimator 
Balanced estimator 



Figure 2. Mean squared error <§> 2 = (yH — HJ ) of different entropy estimators 
(Upper row: Shannon (left): Renyi with q = 1.5 (right): Lower row: Tsallis with 
q = 1.5) for a binary sequence of N = 20, as a function of p. The set of possible values 
is {xi — 1, X2 — 0} and the probabilities, {pi — p 7 p2 = I — £>}, respectively. 



The balanced estimators for Renyi [23] and Tsallis entropy are then given respectively 
by: 

1 



and 



H b T al (q) 



l-q 



q-1 



-In 



M 



i=i 



i=l 



(39) 



(40) 



To illustrate the performance of these estimators, let us consider again a binary 
sequence of iV bits x — 0, 1 (i.e. M — 2) occurring with probabilities 1 — p and p 

respectively. In Fig. [2] we plot the mean squared deviation $ 2 = ((^H — Hj ) of various 
estimators from the true value of the Shannon as well as the Renyi entropy as a function 
of p. For such a short bit sequence, the performance of the Grassberger's estimator 
using the parameter $ 2 , is even worse than the naive one. This is not surprising since 
Grassberger's estimator is designed for small probabilities Pi <C 1, while in the present 
example one of the probabilities p or 1 — p is always large and thus the estimator is 
affected by large fluctuations. The balanced estimator, however, reduces mean squared 
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error considerably over an extended range of p while for small p and 0.4 < p < 0.6 it 
fails. Similar plots can be obtained for the Tsallis entropy. 

The advantage of the balanced estimator compared to standard ones decreases 
with increasing N. One of the reasons is the circumstance that the fluctuations of 
the estimator are basically determined by the randomness of the rij and, therefore, are 
difficult to reduce. 

4. Conclusions 

We have designed a new "balanced estimator" for different entropy functionals 
(Shannon, Renyi, and Tsallis) specially adequate for the analysis of small data sets where 
the possible states appear with not-too-small probabilities. To construct it, first we have 
illustrated a known result establishing that systematic errors (bias) and statistical errors 
cannot both be simultaneously reduced to arbitrarily small values when constructing an 
estimator for a limited data set. In particular, we have designed a low-bias estimator 
and highlighted that it leads to uncontrolled statistical fluctuations. This hinders the 
practical usefulness of such a low-bias estimator. 

On the other hand, we have designed a new estimator that constitutes a 
good compromise between minimizing the bias and keeping controlled statistical 
fluctuations. We have illustrated how this balanced estimator outperforms (in reducing 
simultaneously bias and fluctuations) previously available ones in special situations 
the data sets are sufficiently small and the probabilities are not too small. Obviously 
situations such as in Fig. 2 are the 'worst case' for estimators like ( fl3|) and ( jl~6j) which 
were designed to be efficient for large M. If any of these conditions is not fulfilled 
Grassberger's and Schiirmann's estimator remains the best choice. 

The balanced method fills a gap in the list of existing entropy estimators, is easy to 
implement for Shannon, Renyi and Tsallis entropy functional and therefore its potential 
range applicability is very large, specially in analyses of short biological (DNA, genes, 
etc.) data series. 

The balanced estimator proposed here is simple but by no means 'optimal' for two 
reasons. First, we made no effort to optimize the location of the mesh points pj, which 
for simplicity are assumed to be equidistant. Moreover, we did not optimize the weights 
w(pj) towards a Bayesian estimate, as e.g. attempted by Wolpert and Wolf [8j. Further 
effort in this direction would be desirable. 

We acknowledge financial support from the Spanish Ministerio de Education y 
Ciencia (FIS2005-00791) and Junta de Andalucfa (FQM-165). 
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